The Web Archive of the Internet Archive started in late 1996, is made available through the Wayback Machine , and some collections are available in bulk to researchers. Many pages are archived by the Internet Archive for other contributors including partners of Archive-IT , and Save Page Now users. Other captures are donated to the Internet Archive by other partners such as Alexa Internet .
Topic: Web Archive
The Internet Archive discovers and captures web pages through many different web crawls. At any given time several distinct crawls are running, some for months, and some every day or longer. View the web archive through the Wayback Machine .
Topic: webwidecrawl
Wide crawls of the Internet conducted by Internet Archive. Please visit the Wayback Machine to explore archived web sites. Since September 10th, 2010, the Internet Archive has been running Worldwide Web Crawls of the global web, capturing web elements, pages, sites and parts of sites. Each Worldwide Web Crawl was initiated from one or more lists of URLs that are known as "Seed Lists". Descriptions of the Seed Lists associated with each crawl may be provided as part of the metadata for...
Starting in 1996, Alexa Internet has been donating their crawl data to the Internet Archive. Flowing in every day, these data are added to the Wayback Machine after an embargo period.
Topics: web crawl, Alexa
This library contains digital images uploaded by Archive users which range from maps to astronomical imagery to photographs of artwork. Many of these images are available for free download.
Topic: images
Survey crawls are run about twice a year, on average, and attempt to capture the content of the front page of every web host ever seen by the Internet Archive since 1996.
Topic: survey crawls
Content crawled via the Wayback Machine Live Proxy mostly by the Save Page Now feature on web.archive.org. Liveweb proxy is a component of Internet Archive’s wayback machine project. The liveweb proxy captures the content of a web page in real time, archives it into a ARC or WARC file and returns the ARC/WARC record back to the wayback machine to process. The recorded ARC/WARC file becomes part of the wayback machine in due course of time.
Download or listen to free music and audio This library contains recordings ranging from alternative news programming, to Grateful Dead concerts, to Old Time Radio shows, to book and poetry readings, to original music uploaded by our users. Many of these audios and MP3s are available for free download. Check our FAQ for more information . Contribute Your Audio Please feel free to upload your audio (Uploaders, please set a Creative Commons license as part of the upload process, so people know...
Topic: Audio
8.5B
8.5B
Dec 16, 2004
12/04
by
Internet Archive
The Internet Archive offers over 20,000,000 freely downloadable books and texts. There is also a collection of 2.3 million modern eBooks that may be borrowed by anyone with a free archive.org account. Borrow a Book Books on Internet Archive are offered in many formats, including DAISY files intended for print disabled people. In addition to the collections here, print disabled people may access a large collection of modern books provided as encrypted DAISY files on...
Topics: Texts, Kindle, Ebook, Nook, Books, Documents
Archive-It is a subscription web archiving service of the Internet Archive that helps organizations harvest, build, and preserve collections of digital content. Partners create domain specific collections of web captures that can be searched on Archive It . Content is hosted and stored at the Internet Archive data centers. Archive-It works with more than 400 partner organizations in 48 U.S. states and 16 countries worldwide including: College and University Libraries State Archives, Libraries,...
Topic: Colleges, Universities, Libraries, Archives, NGOs, Museums
Archive-It is the leading web archiving service for collecting and accessing cultural heritage on the web and is a service of Internet Archive used by libraries, archives, governments, non-profits, and other organizations to build collections of web materials.
Topic: TK
Download or listen to free movies, films, and videos This library contains digital movies uploaded by Archive users which range from classic full-length films, to daily alternative news broadcasts, to cartoons and concerts. Many of these videos are available for free download. Check our FAQ for more information . Contribute Your Movies and Video Please feel free to upload your movies (Uploaders, please set a Creative Commons license as part of the upload process, so people know what they can do...
Topic: Moving Images
5.6B
5.6B
Dec 14, 2005
12/05
by
Community Audio
You are invited to view or upload audios to the Community collection. These thousands of recordings were all contributed by Archive users and community members. Please select a Creative Commons License during upload so that others will know what they may (or may not) do with with your audio. Click here to contribute your audio ! Browse by style: Blues , Country , Electronic , Experimental , Hiphop , Indie , Jazz , Rock , Spoken Word .
5.2B
5.2B
Nov 4, 2011
11/11
by
Internet Archive
Focused crawls are collections of frequently-updated webcrawl data from narrow (as opposed to broad or wide) web crawls, often focused on a single domain or subdomain.
Topic: webcrawl
Formed in 2009, the Archive Team (not to be confused with the archive.org Archive-It Team) is a rogue archivist collective dedicated to saving copies of rapidly dying or deleted websites for the sake of history and digital heritage. The group is 100% composed of volunteers and interested parties, and has expanded into a large amount of related projects for saving online and digital history. History is littered with hundreds of conflicts over the future of a community, group, location or...
These crawls are part of an effort to archive pages as they are created and archive the pages that they refer to. That way, as the pages that are referenced are changed or taken from the web, a link to the version that was live when the page was written will be preserved. Then the Internet Archive hopes that references to these archived pages will be put in place of a link that would be otherwise be broken, or a companion link to allow people to see what was originally intended by a page's...
A daily collection of thousands of the most popular web sites according to Alexa.com's top sites rankings .
Topics: daily, popular sites, Alexa
3.1B
3.1B
Feb 26, 2005
02/05
by
Internet Archive
You are invited to view or upload your videos to the Community collection. These thousands of videos were contributed by Archive users and community members. These videos are available for free download. Please select a Creative Commons License during upload so that others will know what they may (or may not) do with with your video. Click here to upload your video !
Topic: Moving Images
The seed for Wide00014 was: - Slash pages from every domain on the web: -- a list of domains using Survey crawl seeds -- a list of domains using Wide00012 web graph -- a list of domains using Wide00013 web graph - Top ranked pages (up to a max of 100) from every linked-to domain using the Wide00012 inter-domain navigational link graph -- a ranking of all URLs that have more than one incoming inter-domain link (rank was determined by number of incoming links using Wide00012 inter domain links)...
Collections of Wiki data
Topics: crawls, data, wiki
Crawl of outlinks from wikipedia.org . These files are currently not publicly accessible. from Wikipedia : Wikipedia is a multilingual, web-based, free-content encyclopedia project operated by the Wikimedia Foundation and based on an openly editable model. The name "Wikipedia" is a portmanteau of the words wiki (a technology for creating collaborative websites, from the Hawaiian word wiki, meaning "quick") and encyclopedia. Wikipedia's articles provide links to guide the...
ArchiveBot is an IRC bot designed to automate the archival of smaller websites (e.g. up to a few hundred thousand URLs). You give it a URL to start at, and it grabs all content under that URL, records it in a WARC, and then uploads that WARC to ArchiveTeam servers for eventual injection into the Internet Archive (or other archive sites). To use ArchiveBot, drop by #archivebot on EFNet. To interact with ArchiveBot, you issue commands by typing it into the channel. Note you will need channel...
Topics: archiveteam, archivebot, webcrawl, robot, love
2.4B
2.4B
Jan 18, 2005
01/05
by
Internet Archive
Texts contributed by the community. Click here to contribute your book ! For more information and how-to please see help.archive.org/hc/en-us/articles/360002360111-Uploading-A-Basic-Guide Uploaders, please note: Archive.org supports metadata about items in just about any language so long as the characters are UTF8 encoded Find books by language: Afar Books Afrikaans Books Akan Books Albanian Books Arabic Books Armenian Books Aymara Books Azerbaijan Books Balochi Books Bambara Books Bangla Books...
Topic: Texts
The American Libraries collection includes material contributed from across the United States. Institutions range from the Library of Congress to many local public libraries. As a whole, this collection of material brings holdings that cover many facets of American life and scholarship into the public domain. Significant portions of this collection have been generously sponsored by Microsoft , Yahoo! , The Sloan Foundation , and others.
The seed for this crawl was a list of every host in the Wayback Machine This crawl was run at a level 1 (URLs including their embeds, plus the URLs of all outbound links including their embeds) The WARC files associated with this crawl are not currently available to the general public.
Wide17 was seeded with the "Total Domains" list of 256,796,456 URLs provided by Domains Index on June 26th, and crawled with max-hops set to "3" and de-duplication set "on".
2.1B
2.1B
Apr 8, 2011
04/11
by
Internet Archive
Large-scale web harvests and national domain crawls performed for National Libraries, National Archives, preservation partners, research initiatives, and as part of special projects and custom crawling and research services.
Topic: ccs
Listen to free audio books and poetry recordings! This library of audio books and poetry features digital recordings and MP3's from the Naropa Poetics Audio Archive, LibriVox, Project Gutenberg, Maria Lectrix, and Internet Archive users.
A collection of data and miscellaneous media donated by individuals to the Internet Archive.
LibriVox - founded in 2005 - is a community of volunteers from all over the world who record public domain texts: poetry, short stories, whole books, even dramatic works, in many different languages. All LibriVox recordings are in the public domain in the USA and available as free downloads on the internet. If you are not in the USA, please check your country's copyright law before downloading. Please visit the LibriVox website where you can search for books that interest you. You can search or...
This is a collection of web page captures from links added to, or changed on, Wikipedia pages. The idea is to bring a reliability to Wikipedia outlinks so that if the pages referenced by Wikipedia articles are changed, or go away, a reader can permanently find what was originally referred to. This is part of the Internet Archive's attempt to rid the web of broken links .
Topics: Wikipedia, Wikimedia
this data is currently not publicly accessible.
The seed for this crawl was a list of every host in the Wayback Machine This crawl was run at a level 1 (URLs including their embeds, plus the URLs of all outbound links including their embeds) The WARC files associated with this crawl are not currently available to the general public.
Additional collections of scanned books, articles, and other texts (usually organized by topic) are presented here.
Web wide crawl with initial seedlist and crawler configuration from April 2013.
miscellaneous data
Topic: brad tofel
Web wide crawl with initial seedlist and crawler configuration from January 2015.
A daily collection of hundreds of the world's top news sites.
Topics: daily, news
Web wide crawl number 16 The seed list for Wide00016 was made from the join of the top 1 million domains from CISCO and the top 1 million domains from Alexa.
Web wide crawl with initial seedlist and crawler configuration from June 2014.
The seeds for this crawl came from: 251 million Domains that had at least one link from a different domain in the Wayback Machine, across all time ~ 300 million Domains that we had in the Wayback, across all time 55,945,067 Domains from https://archive.org/details/wide00016 This crawl was run with a Heritrix setting of "maxHops=0" (URLs including their embeds) The WARC files associated with this crawl are not currently available to the general public.
A daily crawl of more than 200,000 home pages of news sites, including the pages linked from those home pages. Site list provided by The GDELT Project
Topics: GDELT, News
Wayback indexes. This data is currently not publicly accessible.
The seed for this crawl was a list of every host in the Wayback Machine This crawl was run at a level 1 (URLs including their embeds, plus the URLs of all outbound links including their embeds) The WARC files associated with this crawl are not currently available to the general public.
The Internet Archive Software Collection is the largest vintage and historical software library in the world, providing instant access to millions of programs, CD-ROM images, documentation and multimedia. The collection includes a broad range of software related materials including shareware, freeware, video news releases about software titles, speed runs of actual software game play, previews and promos for software games, high-score and skill replays of various game genres, and the art of...
This is a Collection of URLs (and Outlinked URLs) extracted from a random feed of 1% of all Tweets.